The Annals of Applied Statistics — Latest Matching Preprints

1

Filter inference: A scalable nonlinear mixed effects inference approach for snapshot time series data

Augustin, D.; Lambert, B.; Wang, K.; Walz, A.-C.; Robinson, M.; Gavaghan, D.

2022-11-02 bioinformatics 10.1101/2022.11.01.514702 medRxiv

Top 0.1%

10.0%

Show abstract

Variability is an intrinsic property of biological systems and is often at the heart of their complex behaviour. Examples range from cell-to-cell variability in cell signalling pathways to variability in the response to treatment across patients. A popular approach to model and understand this variability is nonlinear mixed effects (NLME) modelling. However, estimating the parameters of NLME models from measurements quickly becomes computationally expensive as the number of measured individuals grows, making NLME inference intractable for datasets with thousands of measured individuals. This shortcoming is particularly limiting for snapshot datasets, common e.g. in cell biology, where high-throughput measurement techniques provide large numbers of single cell measurements. We extend earlier work by Hasenauer et al (2011) to introduce a novel approach for the estimation of NLME model parameters from snapshot measurements, which we call filter inference. Filter inference is a new variant of approximate Bayesian computation, with dominant computational costs that do not increase with the number of measured individuals, making efficient inferences from snapshot measurements possible. Filter inference also scales well with the number of model parameters, using state-of-the-art gradient-based MCMC algorithms, such as the No-U-Turn Sampler (NUTS). We demonstrate the properties of filter inference using examples from early cancer growth modelling and from epidermal growth factor signalling pathway modelling. Author summaryNonlinear mixed effects (NLME) models are widely used to model differences between individuals in a population. In pharmacology, for example, they are used to model the treatment response variability across patients, and in cell biology they are used to model the cell-to-cell variability in cell signalling pathways. However, NLME models introduce parameters, which typically need to be estimated from data. This estimation becomes computationally intractable when the number of measured individuals - be they patients or cells - is too large. But, the more individuals are measured in a population, the better the variability can be understood. This is especially true when individuals are measured only once. Such snapshot measurements are particularly common in cell biology, where high-throughput measurement techniques provide large numbers of single cell measurements. In clinical pharmacology, datasets consisting of many snapshot measurements are less common but are easier and cheaper to obtain than detailed time series measurements across patients. Our approach can be used to estimate the parameters of NLME models from snapshot time series data with thousands of measured individuals.

2

Generative AI-assisted Bayesian-frequentist Hybrid Inference in Single-cell RNA Sequencing Analysis for Genes Associated with Alzheimer's Disease

Han, G.; Yuan, A.; Oware, K. D.; Wright, F.; Carroll, R. J.; Smith, M.; Ory, M. G.; Yan, D.; Wang, W.; Sun, Z.; Dai, Q.; Allen, C.; Dang, A.; Liu, Y.

2026-04-20 geriatric medicine 10.64898/2026.04.17.26351142 medRxiv

Top 0.1%

6.9%

Show abstract

Alzheimers disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimers disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.

3

Discriminative Bayesian Serology: Counting Without Cutoffs

Christian, M.; Murrell, B.

2020-07-14 immunology 10.1101/2020.07.14.202150 medRxiv

Top 0.1%

6.7%

Show abstract

During the emergence of a pandemic, we need to estimate the prevalence of a disease using serological assays whose characterization is incomplete, relying on limited validation data. This introduces uncertainty for which we need to account. In our treatment, the data take the form of continuous assay measurements of antibody response to antigens (eg. ELISA), and fall into two groups. The training data includes the confirmed positive or negative infection status for each sample. The population data includes only the assay measurements, and is assumed to be a random sample from the population from which we estimate the seroprevalence. We use the training data to model the relationship between assay values and infection status, capturing both individual-level uncertainty in infection status, as well as uncertainty due to limited training data. We then estimate the posterior distribution over population prevalence, additionally capturing uncertainty due to finite samples. Finally, we introduce a means to pool information over successive time points, using a Gaussian process, which dramatically reduces the variance of our estimates. The methodological approach we here describe was developed to support the longitudinal characterization of the seroprevalence of COVID-19 in Stockholm, Sweden.

4

Rapid Estimation of SNP Heritability using Predictive Process approximation in Large scale Cohort Studies

Seal, S.; Datta, A.; Basu, S.

2021-05-14 bioinformatics 10.1101/2021.05.12.443931 medRxiv

Top 0.1%

6.2%

Show abstract

With the advent of high throughput genetic data, there have been attempts to estimate heritability from genome-wide SNP data on a cohort of distantly related individuals using linear mixed model (LMM). Fitting such an LMM in a large scale cohort study, however, is tremendously challenging due to its high dimensional linear algebraic operations. In this paper, we propose a new method named PredLMM approximating the aforementioned LMM motivated by the concepts of genetic coalescence and gaussian predictive process. PredLMM has substantially better computational complexity than most of the existing LMM based methods and thus, provides a fast alternative for estimating heritability in large scale cohort studies. Theoretically, we show that under a model of genetic coalescence, the limiting form of our approximation is the celebrated predictive process approximation of large gaussian process likelihoods that has well-established accuracy standards. We illustrate our approach with extensive simulation studies and use it to estimate the heritability of multiple quantitative traits from the UK Biobank cohort.

5

Bayesian semi-nonnegative matrix tri-factorization to identify pathways associated with cancer phenotypes

Park, S.; Kar, N.; Cheong, J.-H.; Hwang, T. H.

2019-08-20 bioinformatics 10.1101/739110 medRxiv

Top 0.1%

5.1%

Show abstract

Accurate identification of pathways associated with cancer phenotypes (e.g., cancer sub-types and treatment outcome) could lead to discovering reliable prognostic and/or predictive biomarkers for better patients stratification and treatment guidance. In our previous work, we have shown that non-negative matrix tri-factorization (NMTF) can be successfully applied to identify pathways associated with specific cancer types or disease classes as a prognostic and predictive biomarker. However, one key limitation of non-negative factorization methods, including various non-negative bi-factorization methods, is their lack of ability to handle non-negative input data. For example, many molecular data that consist of real-values containing both positive and negative values (e.g., normalized/log transformed gene expression data where negative value represents down-regulated expression of genes) are not suitable input for these algorithms. In addition, most previous methods provide just a single point estimate and hence cannot deal with uncertainty effectively.\n\nTo address these limitations, we propose a Bayesian semi-nonnegative matrix trifactorization method to identify pathways associated with cancer phenotypes from a realvalued input matrix, e.g., gene expression values. Motivated by semi-nonnegative factorization, we allow one of the factor matrices, the centroid matrix, to be real-valued so that each centroid can express either the up- or down-regulation of the member genes in a pathway. In addition, we place structured spike-and-slab priors (which are encoded with the pathways and a gene-gene interaction (GGI) network) on the centroid matrix so that even a set of genes that is not initially contained in the pathways (due to the incompleteness of the current pathway database) can be involved in the factorization in a stochastic way specifically, if those genes are connected to the member genes of the pathways on the GGI network. We also present update rules for the posterior distributions in the framework of variational inference. As a full Bayesian method, our proposed method has several advantages over the current NMTF methods which are demonstrated using synthetic datasets in experiments. Using the The Cancer Genome Atlas (TCGA) gastric cancer and metastatic gastric cancer immunotherapy clinical-trial datasets, we show that our method could identify biologically and clinically relevant pathways associated with the molecular sub-types and immunotherapy response, respectively. Finally, we show that those pathways identified by the proposed method could be used as prognostic biomarkers to stratify patients with distinct survival outcome in two independent validation datasets. Additional information and codes can be found at https://github.com/parks-cs-ccf/BayesianSNMTF.

6

A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification

Coleman, S.; Castro Dopico, X.; Karlsson Hedestam, G. B.; Kirk, P. D.; Wallace, C.

2022-01-14 bioinformatics 10.1101/2022.01.14.476352 medRxiv

Top 0.1%

4.8%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWSystematic differences between batches of samples present significant challenges when analysing biological data. Such batch effects are well-studied and are liable to occur in any setting where multiple batches are assayed. Many existing methods for accounting for these have focused on high-dimensional data such as RNA-seq and have assumptions that reflect this. Here we focus on batch-correction in low-dimensional classification problems. We propose a semi-supervised Bayesian generative classifier based on mixture models that jointly predicts class labels and models batch effects. Our model allows observations to be probabilistically assigned to classes in a way that incorporates uncertainty arising from batch effects. By simultaneously inferring the classification and the batch-correction our method is more robust to dependence between batch and class than pre-processing steps such as ComBat. We explore two choices for the within-class densities: the multivariate normal and the multivariate t. A simulation study demonstrates that our method performs well compared to popular off-the-shelf machine learning methods and is also quick; performing 15,000 iterations on a dataset of 750 samples with 2 measurements each in 11.7 seconds for the MVN mixture model and 14.7 seconds for the MVT mixture model. We further validate our model on gene expression data where cell type (class) is known and simulate batch effects. We apply our model to two datasets generated using the enzyme-linked immunosorbent assay (ELISA), a spectrophotometric assay often used to screen for antibodies. The examples we consider were collected in 2020 and measure seropositivity for SARS-CoV-2. We use our model to estimate seroprevalence in the populations studied. We implement the models in C++ using a Metropolis-within-Gibbs algorithm, available in the R package batchmix. Scripts to recreate our analysis are at https://github.com/stcolema/BatchClassifierPaper.

7

Bayesian inference of power law distributions

Atwal, G. S.; Grigaityte, K.

2019-06-18 bioinformatics 10.1101/664243 medRxiv

Top 0.1%

4.3%

Show abstract

Observed data from many research disciplines, ranging from cellular biology to economics, often follow a particular long-tailed distribution known as a power law. Despite the ubiquity of natural power laws, inferring the exact form of the distribution from sampled data remains challenging. The possible presence of multiple generative processes giving rise to an unknown weighted mixture of distinct power law distributions in a single dataset presents additional challenges. We present a probabilistic solution to these issues by developing a Bayesian inference approach, with Markov chain Monte Carlo sampling, to accurately estimate power law exponents, the number of mixtures, and their weights, for both discrete and continuous data. We determine an objective prior distribution that is invariant to reparameterization of parameters, and demonstrate its effectiveness to accurately infer exponents, even in the low sample limit. Finally, we provide a comprehensive and documented software package, written in Python, of our Bayesian inference methodology, freely available at https://github.com/AtwalLab/BayesPowerlaw.

8

Extrinsic biological stochasticity and technical noise normalization of single-cell RNA sequencing data

Fang, M.; Pachter, L.

2025-05-12 bioinformatics 10.1101/2025.05.11.653373 medRxiv

Top 0.1%

4.3%

Show abstract

The technical noise introduced during single-cell RNA sequencing (scRNA-seq) has led to the use of size factor normalization as a first step prior to data analysis. However, this scaling approach inherently affects extrinsic (between cell) variability of gene expression, which stems from both biological and technical factors. Based on previous models on biological and technical extrinsic noise, we propose a general extrinsic noise model for scRNA-seq to provide a theoretical basis for size factor normalization, thus providing a framework for estimating both biological and technical components of extrinsic noise. We highlight the relationship between normalized gene expression covariance, extrinsic noise, and overdispersion, showing that extrinsic noise explains the baseline overdispersion commonly observed in scRNA-seq data. We validated the technical model by testing the relationship on data from pooled RNA. Interestingly, our model accurately describes mature mRNA counts but not nascent mRNA counts, suggesting the need for an alternative technical model for data derived from nascent transcripts. Using single-cell RNA-seq data, we characterize both biological and technical extrinsic noise and cell size factors estimated using Poisson-like genes. Overall, our model helps clarify common misconceptions and provides insight into the role of extrinsic noise and size factor normalization in scRNA-seq data.

9

Estimating mutual information under measurement error

Ma, C.; Kingsford, C.

2019-11-23 bioinformatics 10.1101/852384 medRxiv

Top 0.1%

4.2%

Show abstract

Mutual information is widely used to characterize dependence between biological signals, such as co-expression between genes or co-evolution between amino acids. However, measurement error of the biological signals is rarely considered in estimating mutual information. Measurement error is widespread and non-negligible in some cases. As a result, the distribution of the signals is blurred, and the mutual information may be biased when estimated using the blurred measurements. We derive a corrected estimator for mutual information that accounts for the distribution of measurement error. Our corrected estimator is based on the correction of the probability mass function (PMF) or probability density function (PDF, based on kernel density estimation). We prove that the corrected estimator is asymptotically unbiased in the (semi-) discrete case when the distribution of measurement error is known. We show that it reduces the estimation bias in the continuous case under certain assumptions. On simulated data, our corrected estimator leads to a more accurate estimation for mutual information when the sample size is not the limiting factor for estimating PMF or PDF accurately. We compare the uncorrected and corrected estimator on the gene expression data of TCGA breast cancer samples and show a difference in both the value and the ranking of estimated mutual information between the two estimators.

10

Non-parametric Bayesian density estimation for biological sequence space with applications to pre-mRNA splicing and the karyotypic diversity of human cancer

Chen, W.-C.; Zhou, J.; Sheltzer, J. M.; Kinney, J. B.; McCandlish, D. M.

2020-11-27 bioinformatics 10.1101/2020.11.25.399253 medRxiv

Top 0.1%

4.1%

Show abstract

Density estimation in sequence space is a fundamental problem in machine learning that is of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy, i.e. calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates. Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data is plentiful while still maintaining a conservative maximum entropy char-acter in regions of sequence space where data is sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyper-parameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5' splice sites found in the human genome and to understand the accumulation of chromosomal abnormalities during cancer progression.

11

InstaPrism: an R package for fast implementation of BayesPrism

Hu, M.; Chikina, M.

2023-03-10 bioinformatics 10.1101/2023.03.07.531579 medRxiv

Top 0.1%

4.0%

Show abstract

Computational cell-type deconvolution is an important analytic technique for modeling the compositional heterogeneity of bulk gene expression data. A conceptually new Bayesian approach to this problem, BayesPrism, has recently been proposed and has subsequently been shown to be superior in accuracy and robustness against model misspecifications by independent studies. However, given that BayesPrism relies on Gibbs sampling, it is orders of magnitude more computationally expensive than standard approaches. Here, we introduce the InstaPrism algorithm which re-implements BayesPrism in a derandomized framework by replacing the time-consuming Gibbs sampling steps in BayesPrism with a fixed-point algorithm. We demonstrate that the new algorithm is effectively equivalent to BayesPrism while providing a considerable speed advantage. InstaPrism is implemented as a standalone R package with C++ backend and can be accessed from GitHub at https://github.com/humengying0907/InstaPrism.

12

On Multiply Robust Mendelian Randomization (MR2) With Many Invalid Genetic Instruments

Sun, B.; Liu, Z.; Tchetgen Tchetgen, E.

2021-10-26 epidemiology 10.1101/2021.10.21.21265317 medRxiv

Top 0.1%

3.9%

Show abstract

Mendelian randomization (MR) is a popular instrumental variable (IV) approach, in which genetic markers are used as IVs. In order to improve efficiency, multiple markers are routinely used in MR analyses, leading to concerns about bias due to possible violation of IV exclusion restriction of no direct effect of any IV on the out-come other than through the exposure in view. To address this concern, we introduce a new class of Multiply Robust MR (MR2) estimators that are guaranteed to remain consistent for the causal effect of interest provided that at least one genetic marker is a valid IV without necessarily knowing which IVs are invalid. We show that the proposed MR2 estimators are a special case of a more general class of estimators that remain consistent provided that a set of at least k{dagger} out of K candidate instrumental variables are valid, for k{dagger}[≤] K set by the analyst ex ante, without necessarily knowing which IVs are invalid. We provide formal semiparametric theory supporting our results, and characterize the semiparametric efficiency bound for the exposure causal effect which cannot be improved upon by any regular estimator with our favorable robustness property. We conduct extensive simulation studies and apply our methods to a large-scale analysis of UK Biobank data, demonstrating the superior empirical performance of MR2 compared to competing MR methods.

13

The Rayleigh Quotient and Contrastive Principal Component Analysis II

Jackson, K. C.; Carilli, M. T.; Pachter, L.

2026-04-10 bioinformatics 10.64898/2026.04.08.717236 medRxiv

Top 0.1%

3.7%

Show abstract

Contrastive principal component analysis (PCA) methods are effective approaches to dimensionality reduction where variance of a target dataset is maximized while variance of a background dataset is minimized. We previously described how contrastive PCA problems can be written as solutions to generalized eigenvalue problems that maximize particular instantiations of the Rayleigh quotient. Here, we discuss two extensions of contrastive PCA: we use kernel weighting from spatial PCA (k-{rho}PCA) to contrast spatial and non-spatial axes of variation, and separately solve the Rayleigh quotient in the space of basis function coefficients (f-{rho}PCA) to find modes of variation in functional data. Together, these extensions expand the scope of contrastive PCA while unifying disparate fields of spatial and functional methods within a single conceptual and mathematical framework. We showcase the utility of these extensions with several examples drawn from genomics, analyzing gene expression in cancer and immune response to vaccination.

14

Tracking Hematopoietic Stem Cell Evolution In A Wiskott-Aldrich Clinical Trial

Pellin, D.; Biasco, L.; Scala, S.; Di Serio, C.; Wit, E. C.

2022-05-31 bioinformatics 10.1101/2022.05.30.494052 medRxiv

Top 0.1%

3.6%

Show abstract

Hematopoietic Stem Cells (HSC) are the cells that give rise to all other blood cells and, as such, they are crucial in the healthy development of individuals. Wiskott-Aldrich Syndrome (WAS) is a severe disorder affecting the regulation of hematopoietic cells and is caused by mutations in the WASP gene. We consider data from a revolutionary gene therapy clinical trial, where HSC harvested from 3 WAS patients bone marrow have been edited and corrected using viral vectors. Upon re-infusion into the patient, the HSC multiply and differentiate into other cell types. The aim is to unravel the cell multiplication and cell differentiation process, which has until now remained elusive. This paper models the replenishment of blood lineages resulting from corrected HSC via a multivariate, density-dependent Markov process and develops an inferential procedure to estimate the dynamic parameters given a set of temporally sparsely observed trajectories. Starting from the master equation, we derive a system of non-linear differential equations for the evolution of the first- and second-order moments over time. We use these moment equations in a generalized method-of-moments framework to perform inference. The performance of our proposal has been evaluated by considering different sampling scenarios and measurement errors of various strengths using a simulation study. We also compared it to another state-of-the-art approach and found that our method is statistically more efficient. By applying our method to the Wiskott-Aldrich Syndrome gene therapy data we found strong evidence for a myeloid-based developmental pathway of hematopoietic cells where fates of lymphoid and myeloid cells remain coupled even after the loss of erythroid potential. All code used in this manuscript can be found in the online Supplement, and the latest version of the code is available at github. com/dp3ll1n/SLCDP_v1.0.

15

Spatially aligned random partition models on spatially resolved transcriptomics data

Duan, Y.; Guo, S.; Yan, H.; Wang, W.; Mueller, P.

2025-04-22 bioinformatics 10.1101/2025.04.16.649218 medRxiv

Top 0.1%

3.6%

Show abstract

We propose spatially aligned random partition (SARP) models for clustering multiple types of experimental units, incorporating dependence in a subvector of the cluster-specific parameters, e.g., a subvector of spatial information, as in the motivating application. The approach is developed for inference about co-localization of immune, stromal, and tumor cell sub-populations. The aim is to understand the recruitment of immune and stromal cell subtypes by tumor cells, formalized as spatial dependence of the corresponding homogeneous cell subpopulations. This is achieved by constructing Bayesian nonparametric random partition models for the different types of cells, with a hierarchically structured prior introducing the desired dependence. Specifically, we use Pitman-Yor priors and add dependence in the base measure for spatial features, while leaving the base measure corresponding to gene expression features a priori independent across different types of cells. Details of the model construction are designed to lead to a convenient MCMC algorithm for posterior inference. Simulation studies show favorable performance in identifying co-localization between types of cells. We apply the proposed approach with colorectal cancer (CRC) data and discover subtypes of immune and stromal cells that are spatially aligned with specific tumor regions.

16

A framework to efficiently smooth L1 penalties for linear regression

Hahn, G.; Lutz, S. M.; Laha, N.; Lange, C.

2020-09-19 bioinformatics 10.1101/2020.09.17.301788 medRxiv

Top 0.1%

3.6%

Show abstract

Penalized linear regression approaches that include an L1 term have become an important tool in statistical data analysis. One prominent example is the least absolute shrinkage and selection operator (Lasso), though the class of L1 penalized regression operators also includes the fused and graphical Lasso, the elastic net, etc. Although the L1 penalty makes their objective function convex, it is not differentiable everywhere, motivating the development of proximal gradient algorithms such as Fista, the current gold standard in the literature. In this work, we take a different approach based on smoothing in a fixed parameter setting (the problem size n and number of parameters p are fixed). The methodological contribution of our article is threefold: (1) We introduce a unified framework to compute closed-form smooth surrogates of a whole class of L1 penalized regression problems using Nesterov smoothing. The surrogates preserve the convexity of the original (unsmoothed) objective functions, are uniformly close to them, and have closed-form derivatives everywhere for efficient minimization via gradient descent; (2) We prove that the estimates obtained with the smooth surrogates can be made arbitrarily close to the ones of the original (unsmoothed) objective functions, and provide explicitly computable a priori error bounds on the accuracy of our estimates; (3) We propose an iterative algorithm to progressively smooth the L1 penalty which increases accuracy and is virtually free of tuning parameters. The proposed methodology is applicable to a large class of L1 penalized regression operators, including all the operators mentioned above. Although the resulting estimates are typically dense, sparseness can be enforced again via thresholding. Using simulation studies, we compare our framework to current gold standards such as Fista, glmnet, gLasso, etc. Our results suggest that our proposed smoothing framework provides predictions of equal or higher accuracy than the gold standards while keeping the aforementioned theoretical guarantees and having roughly the same asymptotic runtime scaling.

17

Jointly modeling prevalence, sensitivity and specificity for optimal sample allocation

Larremore, D. B.; Fosdick, B. K.; Zhang, S.; Grad, Y. H.

2020-05-26 immunology 10.1101/2020.05.23.112649 medRxiv

Top 0.1%

3.6%

Show abstract

The design and interpretation of prevalence studies rely on point estimates of the performance characteristics of the diagnostic test used. When the test characteristics are not well defined and a limited number of tests are available, such as during an outbreak of a novel pathogen, tests can be used either for the field study itself or for additional validation to reduce uncertainty in the test characteristics. Because field data and validation data are based on finite samples, inferences drawn from these data carry uncertainty. In the absence of a framework to balance those uncertainties during study design, it is unclear how best to distribute tests to improve study estimates. Here, we address this gap by introducing a joint Bayesian model to simultaneously analyze lab validation and field survey data. In many scenarios, prevalence estimates can be most improved by apportioning additional effort towards validation rather than to the field. We show that a joint model provides superior estimation of prevalence, as well as sensitivity and specificity, compared with typical analyses that model lab and field data separately.

18

BayICE: A hierarchical Bayesian deconvolution model with stochastic search variable selection

Tai, A.-S.; Tseng, G.; Hsieh, W.-P.

2019-08-12 genomics 10.1101/732743 medRxiv

Top 0.1%

3.5%

Show abstract

Gene expression deconvolution is a powerful tool for exploring the microenvironment of complex tissues comprised of multiple cell groups using transcriptomic data. Characterizing cell activities for a particular condition has been regarded as a primary mission against diseases. For example, cancer immunology aims to clarify the role of the immune system in the progression and development of cancer through analyzing the immune cell components of tumors. To that end, many deconvolution methods have been proposed for inferring cell subpopulations within tissues. Nevertheless, two problems limit the practicality of current approaches. First, all approaches use external purified data to preselect cell type-specific genes that contribute to deconvolution. However, some types of cells cannot be found in purified profiles and the genes specifically over- or under-expressed in them cannot be identified. This is particularly a problem in cancer studies. Hence, a preselection strategy that is independent from deconvolution is inappropriate. The second problem is that existing approaches do not recover the expression profiles of unknown cells present in bulk tissues, which results in biased estimation of unknown cell proportions. Furthermore, it causes the shift-invariant property of deconvolution to fail, which then affects the estimation performance. To address these two problems, we propose a novel deconvolution approach, BayICE, which employs hierarchical Bayesian modeling with stochastic search variable selection. We develop a comprehensive Markov chain Monte Carlo procedure through Gibbs sampling to estimate cell proportions, gene expression profiles, and signature genes. Simulation and validation studies illustrate that BayICE outperforms existing deconvolution approaches in estimating cell proportions. Subsequently, we demonstrate an application of BayICE in the RNA sequencing of patients with non-small cell lung cancer. The model is implemented in the R package \"BayICE\" and the algorithm is available for download.

19

RobMixReg: an R package for robust, flexible and high dimensional mixture regression

Chang, W.; Wan, C.; Yu, C.; Yao, W.; Zhang, C.; Cao, S.

2020-08-04 bioinformatics 10.1101/2020.08.02.233460 medRxiv

Top 0.1%

3.0%

Show abstract

MotivationMixture regression has been widely used as a statistical model to untangle the latent subgroups of the sample population. Traditional mixture regression faces challenges when dealing with: 1) outliers and versatile regression forms; and 2) the high dimensionality of the predictors. Here, we develop an R package called RobMixReg, which provides comprehensive solutions for robust, flexible as well as high dimensional mixture modeling. Availability and ImplementationRobMixReg R package and associated documentation is available at CRAN: https://CRAN.R-project.org/package=RobMixReg.

20

High-dimensional Bayesian phenotype classification and model selection using genomic predictors

Linder, D. F.; Panchal, V.

2019-09-23 bioinformatics 10.1101/778472 medRxiv

Top 0.1%

2.8%

Show abstract

MotivationIn this paper we describe a Bayesian hierarchical model termed PMMLogit for classification and model selection in high-dimensional settings with binary phenotypes as outcomes. Posterior computation in the logistic model is known to be computationally demanding due to its non-conjugacy with common priors. We combine a Polya-Gamma based data augmentation strategy and use recent results on Markov chain Monte-Carlo (MCMC) techniques to develop an efficient and exact sampling strategy for the posterior computation. We use the resulting MCMC chain for model selection and choose the best combination(s) of genomic variables via posterior model probabilities. Further, a Bayesian model averaging (BMA) approach using the posterior mean, which averages across visited models, is shown to give superior prediction of phenotypes given genomic measurements.\n\nResultsUsing simulation studies, we compared the performance of the proposed method with other popular methods. Simulation results show that the proposed method is quite effective in selecting the true model and has better estimation and prediction accuracy than other methods. These observations are consistent with theoretical results that have been developed in the statistics literature on optimality for this class of priors. Application to two well-known datasets on colon cancer and leukemia identified genes that have been previously reported in the clinical literature to be related to the disease outcomes.\n\nAvailabilitySource code is publicly available on GitHub at https://github.com/v-panchal/PMML.\n\nContactdlinder@augusta.edu\n\nSupplementary informationSupplementary data are available online.